1 Executive Summary

1.1 Goals and Introduction

You work for a large corporation that owns a collection of restaurants of different types. Currently it is evaluating the location, type, characteristics (cuisine, price point, design, marketing strategy) and positioning of a new restaurant in Edinburgh (if you know the local habits, you can give better insight). Your task is to analyse the dataset and give recommendations on strategy, based on the reviews, location (either neighbourhood or zipcode level) and competition. Also give estimates on volumes and revenues of the potential undertaking. You may have to check out what the price-range attributes signify by visiting the Yelp website.

1.2 Methodology

This analysis is conduced for Edinburgh only. This was because extensive demographic data was found that could be easily integrated with the Yelp dataset. The team also had better knowledge of the zip codes as Edinburgh is still in the UK. This report is split into three main analyses, then concluded with an estimation of revenues and a summary.

Topic Modelling Analysis The reviews of Edinburgh’s resturants are grouped into topics using a NMF topic modelling method to look for potential topics or opportunities to exploit.

Demand and Location Analysis Summary statistics such the number of reviews per resturant and number of reviews by cuisine are mapped out, to give an indication of relative demand by location, which allows for a recommendation of location.

Competition and Style Analysis A metric of competition is computed for different styles of resturant to recommend the appropriate attributes of the resturant.

2 Exploration of Reviews

This analysis looks through reviews for topics or opportunities to take into account when opening a resturant in Edinburgh. For a relatively small collection of reviews, it may be possible to manually inspect and classify the contents of reviews into specific categories based on similarity, but this becomes time consuming for a dataset of 14,000+ reviews.

This section uses topic modelling to identify the topics users care about most when giving their rating stars, ultimately determining what a new restaurant should be doing in order to receive high ratings.

A non-negative matrix factorization (NMF) approach is used to detect topics from reviews. NMF simultaneously performs dimensionality reduction and clustering to identify semantic features in a document collection and groups the documents into clusters on the basis of shared semantic features [1]. The extracted topics from 1-star and 2-star reviews were used as an indicator of bad practice whereas extracted topics from 4-star and 5-star reviews were used as an indicator of good practice for operating a restaurant.

2.1 Description of datasets

Import the datasets.

business = pd.read_csv("edinburgh.csv",header=0)
checkin = pd.read_csv("edinCheckin.csv",header=0)
review = pd.read_csv("edinReview.csv",header=0)
tip = pd.read_csv("edinTip.csv",header=0)
user = pd.read_csv("edinUser.csv",header=0,usecols=range(0,23))

The main dataset in this study was ‘yelp_academic_dataset_review.json’ and summary of the dataset is shown below.

# Replace nan with blank space
review=review.replace(np.nan,' ', regex=True)

# Summary of dataset
review.info()

The text of reviews is full of punctuations, numbers and capital letters, further cleansing of data is required for text analysis.

review.text.head(10)

Data Cleansing

Text after cleaning up the text in review dataset:

# Referenced Regular Expression for email cleanup idea: 
def cleanup(text):

    # Make text lower case
    for f in re.findall("([A-Z]+)", text):
        text = text.replace(f, f.lower())
    
    # Remove escape symbols
    text = text.replace('\r', " ")
    text = text.replace('\n', " ")
    
    # Remove all non-ascii characters in the string
    text=unicode(text, 'ascii', 'ignore')

    # Creata a list of reg tools
    cleanuptools = [
    # Dates
    r"(monday|tuesday|wednesday|thursday|friday|saturday|sunday)",
    # Removing months
    r"january|february|march|april|may|june|july|august|september|october|november|december",
    # Punctuation and numbers to be removed
    r'[-|.|?|!|,|"|:|;|()|0-9]',
    ]
    
    for tool in cleanuptools:
        text = re.sub(tool," ", text)
    return text


# Constructing a list for stopwords 
stopwords = []

# Add scikit-learn's CountVectorizer's stop list to the created list
stopwords = sk.feature_extraction.text.ENGLISH_STOP_WORDS

# Apply the created functions to clean up text
review.text=review.text.apply(cleanup)

# Cleaned text
review.text[28]

2.2 Inspection of review contents

Distribution of number of words

f, ax = plt.subplots(figsize=(15,7.5))
n, bins, patches = ax.hist(review.text.apply(len),facecolor='black',bins=20)
ax.set_xticks(bins)
bin_centers = 0.5 * np.diff(bins) + bins[:-1]
for count, x in zip(n, bin_centers):
    percent = '{:.2f}%'.format((float(count) / n.sum())*100)
    ax.annotate(percent, xy=(x, 0), xycoords=('data', 'axes fraction'),
    xytext=(0, -32), textcoords='offset points', va='top', ha='center')
ax.set_xlabel('Count of words in review')
ax.set_ylabel('Count of reviews')

Only reviews with a considerable amount of contents should be included as the topic modelling will not be able to extract insights from short reviews. The length of reviews was therefore analysed and a minimum requirement on the length of review was implemented. But in general even without cutting out reviews, the number of words are high, more than 85% of reviews have more than 250 words.

The distribution of ratings was also examined to avoid inbalanced datasets.

Reviews VS Stars

review.stars.value_counts(sort=False).plot(kind='bar',color="black")
plt.title('Reviews By Star');

Inituitively, we thought there would be more 1-star and 5-star reviews, however, the data shows otherwise, and the majority of reviews were 4-star and 5-star. It is also important to check the usefulness/reliability of reviews by examining their numbers of votes.

Reviews VS Votes

The majority of reviews have only one vote regardless of vote type (cool, useful or funny). So we decided not to remove reviews based on their numbers of votes (doing so would remove a significant amount of text from this analysis).

Votes for cool

pd.crosstab(review.stars,review.votes_cool[review.votes_cool!=0],margins=True)

Votes for funny

pd.crosstab(review.stars,review.votes_funny[review.votes_funny!=0],margins=True)

Votes for useful

pd.crosstab(review.stars,review.votes_useful[review.votes_useful!=0],margins=True)

2.3 Topic Modelling

Dividing the dataset

Reviews were split into two groups, 1 and 2 star reviews were interpreted as bad reviews and 4 and 5-star reviews were interpreted as good reviews. 3-star reviews were excluded in this analysis due to the mixed expression of customers (three stars could be a good or bad review).

# Split the dataset into 2 categories: 1,2 stars and 4,5 stars

review_bad = review[(review.stars == 1) | (review.stars == 2)]
review_good = review[(review.stars == 4) | (review.stars == 5)]

review_good.text.head(10)

Example of good reviews:

Text feature extraction

A range of functions from the python scikit-learn package was used to extract features (words) from textual reviews content and create a document-term matrix, namely [1]-[3]:

  • tokenizing strings and documenting each possible token in a matrix, where white-space and punctuation were token separators.
  • counting the occurrences of tokens in each document.
  • normalizing and weighting with diminishing importance tokens which occur in the majority of samples / documents.

In a large text corpus, stopwords are common (e.g. “the”, “a”, “is”, “and” in English) and carry little meaningful information about the actual contents of the document. In order to reduce the influence of stop words appearing frequently across the entire corpus, TF-IDF term re-weighting functions normalized the data. TfidfVectorizer function combines the functions of TF-IDF and vectorization (Vectorization is a process combining tokenization, counting and normalization) allowing to build a document-term matrix for the corpus of documents:

#Vectorization
tfidfvectorizer_bad = TfidfVectorizer(max_features=15000, ngram_range=(1, 2), stop_words = stopwords,
                                 strip_accents="unicode", use_idf=True, norm="l2", min_df = 5)
tfidfvectorizer_good = TfidfVectorizer(max_features=15000, ngram_range=(1, 2), stop_words = stopwords,
                                 strip_accents="unicode", use_idf=True, norm="l2", min_df = 5)

#Create term document matrix for separate datasets
term_document_matrix_bad = tfidfvectorizer_bad.fit_transform(review_bad.text)
term_document_matrix_good = tfidfvectorizer_good.fit_transform(review_good.text)

Topic Modelling

NMF (Non-negative Matix Factorization) is a method of unsupervised learning for grouping a collection of documents and finding out abstract topics. The scikit-learn package has an implementation of NMF with NNDSVD initialization.

Nonnegative Double Singular Value Decomposition (NNDSVD) is typically used for overcoming sparseness of data in document-term matrix [8]. Here we set the number of topics to be equal to 25 and run NMF for 200 iterations, then get the factors W and H from the resulting model:

# 1-star + 2-star reviews
nmfmodel_bad = sk.decomposition.NMF(init="nndsvd", n_components=25, max_iter=200).fit(term_document_matrix_bad)
W_bad = nmfmodel_bad.fit_transform(term_document_matrix_bad);
H_bad = nmfmodel_bad.components_
# W (number of reviews,number of topics) and H (number of topics, number of features)
print "Generated factor W of size %s and factor H of size %s for bad reviews" \
% ( str(W_bad.shape), str(H_bad.shape) )

# 4-star + 5-star reviews
nmfmodel_good = sk.decomposition.NMF(init="nndsvd", n_components=25, max_iter=200).fit(term_document_matrix_good)
W_good = nmfmodel_good.fit_transform(term_document_matrix_good);
H_good = nmfmodel_good.components_
# W (number of reviews,number of topics) and H (number of topics, number of features)
print "Generated factor W of size %s and factor H of size %s for good reviews" \
% ( str(W_good.shape), str(H_good.shape) )

2.3.1 Top Terms by each topic

#Write functions to plot graphs which present the top topics discovered by the NMF model and create a list of features generated from the NMF model.

# Create a colour series for graph plotting
def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)

# Writting functions to show top words in each topic
def TopTermsByTopic(nmfmodel, features, top):
    for index, topic in enumerate(nmfmodel.components_):
        print "\n Topic {}: \n".format(index+1)
        print "Percentage of Words: {:.2%}\n".format(np.count_nonzero(topic) / 41961.)
        
        top_words = [features[i] for i in nmfmodel.components_[index].argsort()[::-1][:top]]
        topic_words = ' '.join(top_words)

        #Prepare data for horizontal bar charts
        top15_index = nmfmodel.components_[index].argsort()[::-1][:15]
        top15_topic = sorted(topic[top15_index],reverse=False)

        #Prepare data for wordclouds
        wc = WordCloud(max_font_size=80,relative_scaling=.5,width=800,height=500).generate(topic_words)
       
        #Create a space for graphs
        fig, ax = plt.subplots(2,figsize=(12,10))
        rect1 = ax[0].barh(.5 + np.arange(15) + .5, top15_topic, color="black", align="center")
        rect2 = ax[1].imshow(wc.recolor(color_func=grey_color_func, random_state=3))
        
        #Subplot 1 - Horizontal Bar Chart
        ax[0].set_title("Top 15 Terms in Topic {}".format(index + 1))
        ax[0].set_xlabel("Weight")
        ax[0].set_yticks(.5 + np.arange(15)+ .5)
        ax[0].set_yticklabels([features[i] for i in topic.argsort()[::-1][:15]])
        ax[0].grid(True)
        
        #Subplot 2 - WordCloud
        ax[1].axis("off")
        ax[1].set_title("Wordcloud of Topic {}".format(index + 1))
        
        #Show the graphs
        plt.tight_layout()
        plt.show()
        
        
#Extracting the feature names
features_bad= tfidfvectorizer_bad.get_feature_names()
features_good= tfidfvectorizer_good.get_feature_names()

Horizontal barchart of the top 15 highest weighted terms for the most significant topic - in bad and good reviews - discovered by the NMF model. The rest of the terms are shown in a word cloud.

2.3.1.1 Bad Reviews: 1-star and 2-star reviews

TopTermsByTopic(nmfmodel_bad, features_bad, 100)

Example of Topic 1 for bad reviews:

2.3.1.2 Good Reviews: 4-star and 5-star reviews

TopTermsByTopic(nmfmodel_good, features_good, 100)

Example of Topic 1 for good reviews:

2.3.2 Topics in bad and good reviews

A summary of the 25 topics from the bad and good reviews - NMF model. An attempt to interpret the contents of each topic was made, keywords in each topic were manually examined and a description of each topic was then assigned. Experiments suggest 25 topics is optimal, allowing for a clear separation of topics.

2.3.2.1 Bad Reviews: 1-star and 2-star reviews

data_bad = {'Index of Topics':['Topic 1','Topic 2','Topic 3','Topic 4','Topic 5','Topic 6','Topic 7','Topic 8'
                           ,'Topic 9','Topic 10','Topic 11','Topic 12','Topic 13','Topic 14','Topic 15'
                           ,'Topic 16','Topic 17','Topic 18','Topic 19','Topic 20',
                            'Topic 21','Topic 22','Topic 23','Topic 24','Topic 25']
        ,'Type of Topics':['Bad service from manager and waiter/waitress ', 'Bad Coffee Shop',
                           'Bad Chicken Dishes (Fried, Boiled, Curry)',
                           'Bad Italian Foods (Pizza and Pasta: Toppings, base, sauce, etc.)',
                           'Bad Fried Chips (Soggy Batter)','Bad Burger', 'Unknown Topic',
                           'Bad wait and time management','Bad Experience and Services',
                           'Bad Restaurant (Birthplace of Harry Potter)','Bad Afternoon Tea',
                           'Unfriendly and Rude Staffs','Bad Chinese Sweet and Sour Foods',
                           'Unknown Topic','Bad Mexican Foods','Bad Place for drinks (too quiet)',
                           'Bad Japanese Foods (Tuna, Miso Soup)','Bad Prices','Bad Breakfast (Eggs Benedict)',
                           'Bad Wait and Time Management','Bad Noodles', 'Good Comments in Bad Reviews',
                           'Expensive Place for Tourists', 'Hot Temperature','Bad Thai Foods'                           
                           ]
       } 

topic_table_bad = pd.DataFrame(data_bad)
display(topic_table_bad)

2.3.2.2 Good Reviews: 4-star and 5-star reviews

data_good = {'Index of Topics':['Topic 1','Topic 2','Topic 3','Topic 4','Topic 5','Topic 6','Topic 7','Topic 8'
                           ,'Topic 9','Topic 10','Topic 11','Topic 12','Topic 13','Topic 14','Topic 15'
                           ,'Topic 16','Topic 17','Topic 18','Topic 19','Topic 20',
                            'Topic 21','Topic 22','Topic 23','Topic 24','Topic 25']
        ,'Type of Topics':['Unknown Topic','Good Quality foods','Good Place and Atmosphere',
                           'Good Bars and Pubs', 'Good Scottish Breakfast',
                           'Good Fish and Chips with nice peas','Good Thai Foods with decent prawn',
                           'Good Indian Foods', 'Good Menu','Good Beef Burger with decent sweet potatoes',
                           'Good Price and Value','Good Sandwiches', 'Good Italian Foods (Pasta and Pizza)',
                           'Good chocolates and ice creams','Good Coffee Shops with Nice Artisan and Expresso',
                           'Good Japanese foods (Bento,Nigiri,Kanpai)','Good Mexican Burritos and Tacos',
                           'Excellent Services', 'Good Afternoon Tea', 'Good BBQ Shops (crackling pork, haggis)',
                           'Unknown German Reviews', 'Friendly Staff','Good Foods', 'Good Vegetarian Restaurants',
                           'Good Potato Shops'
                           ]
       } 

topic_table_good = pd.DataFrame(data_good)
display(topic_table_good)

2.3.3 Recommendation

Opportunities to explore when opening a restaurant in Edinburgh:

  • Customers are not satisfied with the type of batter for fish and chips in Edinburgh, which was complained to be too soggy. (Bad Review Topic 5)
  • Customers are also not particularly happy with chicken dishes in Edinburgh. (Bad Review Topic 3)
  • Foods and drinks at a cafe where JK Rowling visited were not satisfying and received a lot of complaints. (Bad Review Topic 10)
  • Sweet and sour chickens are not satisfying at chinese restaurants in Edinburgh. (Bad Review Topic 13)
  • Thai Foods for lunch are not good (Bad Review Topic 25)
  • Current noodle shops in Edinburgh are bad (Bad Review Topic 21)
  • Good atmosphere, location, friendly staff and price-to-value could enhance customer experience (Good Review Topics 2,3,11 and 22)

Areas to avoid and improve when opening a restaurant in Edinburgh:

  • Customers complained about bad services from rude and unfriendly waiter and waitress (Bad Review Topics 1 and 12)
  • Customers complained about long wait (Bad Review Topic 8)
  • Customers complained about room temperature (Bad Review Topic 24)
  • Customers complained about bars and pubs that are too quiet (Bad Review Topic 16)

For sectors where businesses are doing very well would also mean new entrants could face a big challenge to enter these sectors.

  • Good Review Topic 4 appeared in 16.57% of reviews, that implies bars and pubs are providing very good services in Edinburgh.
  • Good Review Topic 15 appeared in 13.32% of reviews, that means current coffee shops are providing very good coffees and services.
  • BBQ shops are doing very well in Edinburgh. (Good Review Topic 20)
  • Potato shops are doing very well in Edinburgh. (Good Review Topic 25)

In conclusion, the insights which were extracted by topic modelling are very useful for planing openning a new restaurant in Edinburgh, for example, it was clear that customers are not satisfied with the current fish and chips and chicken dishes in Edinburgh, a new restaurant could potentially target these specific sectors to full fill the customer demand. Also, the resturant should be careful with certain issues from bad reviews, for example, unfriendly staff, long waitign times, and uncomfortable room temperatures frequently impacted the customer experience. Finally, sectors that are doing very well in Edinburgh such as bars, pubs, BBQ and Potato shops are areas that a new restraunt may not want to consider to avoid fierce competition.

3 Location and Demand

This section estimates the demand for different restaurants per Edinburgh zip code, and looks at the competition per zip. The following steps were performed:

  1. Data cleaning to extract zip codes and restaurant types.
  2. Integrated external demographic data; population, income, and hotels data with the Yelp data.
  3. Calculate metrics for each area in order to recommend a location and cusine type.

3.1 Cleaning the Data

The pandas Python package was used to filter the yelp data to resturants, check-ins, reviews, tips, and users in Edinburgh. This provided data on 1215 restaurants in Edinburgh registered on Yelp.

Regarding restaurant cuisine type, where many different attributes were used to explain the restaurant category, the most expressive type of cuisine type was assigned. For example a resturant might have “Gastropubs,Bars,Scottish,Nightlife,Restaurants”, so the cuisine keyword “Scottish” might be assigned.

The first three or four letters of each resturant’s zip code was extracted from their address, which enables the joining of external demographic data. For example EH6 or EH12.

The data obtained from the Edinburgh government website [4] provided locality and ward demographic profiles. It contained data on; gender, age, housing, employment, education and professions, income, benefits, health and disability, lifestyle, satisfaction with services, and Scottish Index of Multiple Deprivation data - of which population, income and property prices were of primary interest. However, this data was organized by ward, with each ward containing many zip codes. Figure 1 maps different polling districts and figure 2 maps wards districts.

Figure 1: Polling Districts

“Figure 1: Polling Districts”

Figure 2: Wards

“Figure 2: Wards”

To attribute ward demographic data to each zip an external data set of polling districts from the City of Edinburgh Mapping portal [5] was used. Each polling location has a zip code and a ward. When a zip code makes up the majority of occurances in a ward, the demographic data is all attributed to that zip. For example if the ward “Almond” has 4 zip codes in EH4 and 1 in EH2, EH4 will be considered as ward Almond, and takes the Almond demographic data.

As some zip codes such as EH1 and EH2 are really small, they were not assigned to any single ward using the method above. For these zip codes, population numbers were attributed in proportion to the number of polling stations zip’s in the ward. For example the central city ward in general is made up of EH1 and EH2. So the population for central city ward is assigned evenly into the populations of EH1 and EH2. Income and property prices remained the same.

Data on all the hotels in Edinburgh was also downloaded from booking.com and integrated. This data was cleansed, to get hotel counts by zip code.

import json
import pandas as pd
from glob import glob

def convert(x):
    ''' Convert a json string to a flat python dictionary
    which can be passed into Pandas. '''
    ob = json.loads(x)
    for k, v in ob.items():
        if isinstance(v, list):
            ob[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                ob['%s_%s' % (k, kk)] = vv
            del ob[k]
    return ob


def convert2(x):
    ''' Convert a json string to a flat python dictionary
    which can be passed into Pandas. '''
    ob = json.loads(x)
    for k, v in ob.items():
        if isinstance(v, list):
            ob[k] = ','.join(str(v))
        elif isinstance(v, dict):
            for kk, vv in v.items():
                ob['%s_%s' % (k, kk)] = vv
            del ob[k]
    return ob

business = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_business.json")])
checkin = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_checkin.json")])
review = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_review.json")])
tip = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_tip.json")])
user = pd.DataFrame([convert2(line) for line in file("yelp_academic_dataset_user.json")])

cusine = ['Scottish', 'Mediterranean','Chinese','French','Italian', 'Thai',  'Indian', 'British', 'European', 'Greek', 'Nepalese', 'German', 'Turkish', 'Mexican', 'Pakistan', 'Seafood', 'Japanese', 'Vegetarian', 'Brazilian', 'American','Spanish','Bakeries', 'Gastropubs', 'Soup','Caterers','Cafes','Smoothies','Bistros','Pubs', 'Coffee & Tea','Burgers', 'Delis','Fast Food', 'Sandwiches', 'Pizza', 'Fish & Chips', 'Polish', 'African', 'Korean', 'Middle Eastern', 'Creperies', 'Brasseries','Chicken Wings'  ]

def cusineCategory(x):
    for cus in cusine:
        category = x.categories.encode('ascii','ignore')
        if cus in category:
            return cus

def getZip(x, name):
    start= x[name].find("EH")
    Zip = x[name][start:start + 4].replace(" ", "")
    return Zip
   

edinburgh = business[business.city == "Edinburgh"]
edinburgh = edinburgh[edinburgh['categories'].str.contains("Restaurants")]
edinburgh['cusine'] = edinburgh.apply(cusineCategory, axis = 1)
edinburgh['zip'] = edinburgh.apply(getZip,args=('full_address',),axis=1)
edinburgh = edinburgh[edinburgh.zip != "EH17"]
edinburghid = pd.DataFrame(edinburgh.business_id)
edinReview = pd.merge(edinburghid,review, how='left' ,on="business_id")
edinCheckin = pd.merge(edinburghid,checkin, how='left' ,on="business_id")
edinTip = pd.merge(edinburghid,tip, how='left' ,on="business_id")

edinUserIds = pd.concat([edinReview.user_id,edinTip.user_id])
uniqueIds = pd.DataFrame(edinUserIds.unique())
uniqueIds.columns = ['user_id']

edinUsers = pd.merge(uniqueIds,user, how = 'left' , on = 'user_id')

edinburgh.to_csv('edinburgh.csv', encoding = 'utf-8')
edinReview.to_csv('edinReview.csv', encoding = 'utf-8')
edinCheckin.to_csv('edinCheckin.csv', encoding = 'utf-8')
edinTip.to_csv('edinTip.csv',encoding = 'utf-8')
edinUsers.to_csv('edinUser.csv',encoding='utf-8')


polling = pd.read_csv('polling.csv')
polling['zip'] = polling.apply(getZip,args=('POSTCODE',),axis=1)
polling = polling[['NEWWARD','zip']]
pollGroup = polling.groupby(['NEWWARD','zip']).size().sort_values(ascending=False)
pollGroup

wardsList = ['Almond' , 'Drum Brae Gyle' ,'Pentland Hills', 'Forth' , 'Forth.1', 'Inverleith' , 'Corstorphine Murrayfield' , 'Sighthill Gorgie' , 'Colinton Fairmilehead' , 'Fountainbridge Craiglockhart' , 'Meadows Morningside' , 'City Centre' ,'City Centre.1', 'Leith Walk' , 'Leith ' , 'Craigentinny Duddingston' , 'Southside Newington' ,'Southside Newington.1','Liberton Gilmerton' , 'Portobello Craigmillar']
wards = pd.read_csv('rawwards.csv')
wards['Forth.1'] = wards['Forth']
wards['Forth.1'][0] = "EH5"
wards['Forth.1'][1:17]= pd.to_numeric(wards['Forth'][1:17], errors='coerce')/2
wards['Forth'][1:17] = wards['Forth.1'][1:17]
wards['Southside Newington.1'] = wards['Southside Newington']
wards['Southside Newington.1'][0] = "EH9"
wards['Southside Newington.1'][1:17]= pd.to_numeric(wards['Southside Newington'][1:17], errors='coerce')/2
wards['Southside Newington'][1:17] = wards['Southside Newington'][1:17]
wards['City Centre.1'] = wards['City Centre']
wards['City Centre.1'][0] = "EH2"
wards['City Centre.1'][1:17]= pd.to_numeric(wards['City Centre'][1:17], errors='coerce')/2
wards['City Centre'][1:17] = wards['City Centre.1'][1:17]
wardsMelt = pd.melt(wards, id_vars=['Indicator'], value_vars=wardsList)


zipCode= ""
wardName = ""
wardsMelt['Zip'] = ""
for i, row in wardsMelt.iterrows():
    if row['Indicator'] == 'Zip':
        zipCode = row['value']
    wardsMelt['Zip'][i] = zipCode
    
wardsMelt = wardsMelt[wardsMelt.Indicator != 'Zip']
wardsMelt.columns = ['indicator','type','value','Zip']
wardsMelt[wardsMelt['type'] == 'City Centre']
wardsMelt.to_csv('edinPopulation.csv')

3.2 Analysis

Figure 3 below shows the number of stores in Edinburgh. Figure 4 shows the type of cuisine that has the most stores. In general, Italian and British cuisine are the most popular. The city centre (zip codes EH1 and EH2) contained the greatest number of restaurants.

Figure 3: Density Map of Restaurants / Figure 4 Number of Restaurants by Cuisine

“Figure 3: Density Map of Restaurants / Figure 4 Number of Restaurants by Cuisine”

The relationship between cuisine type and the number of reviews (demand) was also inspected (figure 5).

Figure 5: Cuisines with Reviews

“Figure 5: Cuisines with Reviews”

Figure 5 shows that British food has a very high number of reviews in Edinburgh, despite there being more Italian resturants than British (figure 4). It suggests that British restaurants are quite popular and under supplied.

In certain places, such as EH2 (Figure 6) there are substantially more Italian resturants than British. It might be worth capitalising on the gap in the market in EH2.

Figure 6 EH2 Number of Stores and Amount of Cuisines in Edinburgh

“Figure 6 EH2 Number of Stores and Amount of Cuisines in Edinburgh”

Interestingly, the restaurants in the city centre (EH1 and EH2) had the lowest rating average rating (Figure 7, lightpatch in centre), which is contrary to intuition: the centre should have the best resturants. This could be explained by many of the restaurants in the city centre having low scores which pull down the overall average. If one had wanted to keep the resturant in a high ratings area (which attracts visitors) then the restaurant should not be placed in the city centre.

Figure 7: Average rating per zip code and number of resturants by cuisine

“Figure 7: Average rating per zip code and number of resturants by cuisine”

The demand a resturant or zip recieves was proxied using the amount of reviews. Note that the number of users could have also been used as an proxy, but was decided against as users can visit a restaurant twice and give two reviews. Figure 8 shows the locations with the most reviews: EH1, EH2, EH3 and EH8. The analysis is focused on these areas.

Figure 8: Number Reviews per Restaurant and zip

“Figure 8: Number Reviews per Restaurant and zip”

stats = ['All','Average annual household income','Average property value']

population = wardsMelt[wardsMelt.indicator == stats[0]]
population.value = pd.to_numeric(population.value)
populationGroup = population.groupby('Zip').value.sum()

income = wardsMelt[wardsMelt.indicator == stats[1]]
income.value = pd.to_numeric(income.value)
incomeGroup = income.groupby('Zip').value.sum()

propertyValue = wardsMelt[wardsMelt.indicator == stats[2]]
propertyValue.value = pd.to_numeric(propertyValue.value)
propertyGroup = propertyValue.groupby('Zip').value.sum()

restaurantGroup = edinburgh.groupby('zip').cusine.count()

    
hotels = pd.read_csv('edinHotels.csv')
hotels['zip'] = hotels.apply(getZip,args=('zip',),axis=1)

hotelsGroup = hotels.groupby('zip').address.count()

statistics = pd.concat([populationGroup/restaurantGroup,incomeGroup/restaurantGroup,propertyGroup/restaurantGroup, hotelsGroup/restaurantGroup,restaurantGroup],1)
statistics.columns = ['pop/rest','income/rest','property/rest','hotels/rest','Num of Rest']
statistics = statistics.dropna()

Metrics for each zip code are shown in table 1. Included are the ratios of: total population to number of restaurants per zip, average income to the number of restaurants per zip, the total number of properties to the number of restaurants per zip, and the total number of hotels to the number of restaurants per zip.

Table 1: Metrics on zip code

“Table 1: Metrics on zip code”

Stating again, the analysis is focused on candidate areas with high demand EH1, EH2, EH3 and EH8. EH8 has the largest ratio of population to restaurants compared to the other candidate areas. However, it also has the lowest number of hotels per restaurant. This suggests that there will be fewer tourists in the area. In contrast, EH3 has high population to restaurant and hotel to restaurant ratios. Furthermore, EH3’s property value to restaurant ratio is low potentially making it cheaper to buy a restaurant there.

3.3 Recommendation

Figure 9: Density Map of Restaurant and Number of Restaurant by Cuising for zip EH3

“Figure 9: Density Map of Restaurant and Number of Restaurant by Cuising for zip EH3”

Figure 9 shows British cuisine is not too saturated in EH3 as there are more Chinese restaurants than British restaurants and also a lot of Indian Italian and Thai restaurants. EH3 is therefore a potentially good area to open a British resturant.

The topic modeling analysis from the first section, found two recommendations 1) that customers are not satisfied with the type of batter for fish and chips in Edinburgh, which was complained to be too soggy and 2) that customers are also not particularly happy with chicken dishes in Edinburgh. Therefore in EH3, the resturant should focus on being really good at making chicken as well as Fish. It is further recommended that “Chicken and Fish” are the only foods offered in the restaurant so that it can specialize in making it good.

4 Attributes and Competition

This section clusters restaurants by their attributes into 5 groups using hierachical clustering and analyses the competition in each zip code to find opportunities by resturant group.

Groups are defined as classy, general, fastfood, stylish pub & bars, and other. A metric of competition is defined for these groups for each zip code to recommend the attributes of the new resturant.

4.1 Attributes

There are a total 1215 restaurants in dataset, with 55 having no attributes at all. Attributes are factors such as casual, classy, open bar, desert, good for groups, good for children, etc. These 1215 resturants are first clustered into 5 different categories.

After considering different distance metrics and linkages, the Euclidean distance and ward linkage was chosen for their simplicity and better structured dendrogram.

#Consider a few different linkage 
df3_filtered=df3.loc[(df3!=0).any(1)]
df3_filtered=df3_filtered.iloc[:,0:58]
R1=hclst(df3_filtered.iloc[:,1:58],"ward","euclidean")
df3_filtered
#To determine cuts by different measures
from scipy.cluster.hierarchy import inconsistent
depth = 20
incons1 = inconsistent(R1[0], depth)
incons1[-20:]
plt.figure(figsize=(25, 10))
plt.title('Inconsistency Measure of last 20 Merges with deth ' + str(depth))
plt.xlabel('Merges')
plt.ylabel('Inconsistency')
plt.plot(incons1[-20:,3])
plt.show()
#Determine cuts By Elbow
plt.figure(figsize=(25, 10))
plt.title('Height of the last 20 Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Height')
last = R1[0][-20:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
plt.show()

plt.figure(figsize=(25, 10))
plt.title('Rate of increase in Height between Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Rate')
acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()


from scipy.cluster.hierarchy import fcluster
max_d = 15
clusters = fcluster(R1[0], max_d, criterion='distance')
df3_filtered['cluster']=clusters
Fig 10: Dendrogram on style clustering

“Fig 10: Dendrogram on style clustering”

Now to determine cuts by different measures.

#To determine cuts by different measures
from scipy.cluster.hierarchy import inconsistent
depth = 20
incons1 = inconsistent(R1[0], depth)
incons1[-20:]
plt.figure(figsize=(25, 10))
plt.title('Inconsistency Measure of last 20 Merges with deth ' + str(depth))
plt.xlabel('Merges')
plt.ylabel('Inconsistency')
plt.plot(incons1[-20:,3])
plt.show()
#Determine cuts By Elbow
plt.figure(figsize=(25, 10))
plt.title('Height of the last 20 Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Height')
last = R1[0][-20:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
plt.show()

plt.figure(figsize=(25, 10))
plt.title('Rate of increase in Height between Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Rate')
acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()


from scipy.cluster.hierarchy import fcluster
max_d = 15
clusters = fcluster(R1[0], max_d, criterion='distance')
df3_filtered['cluster']=clusters
Fig 11: Inconsistency Measure of last 20 merges

“Fig 11: Inconsistency Measure of last 20 merges”

Fig 12: Height of the last 20 merges

“Fig 12: Height of the last 20 merges”

Fig 13: Rate of increase in Height of the last 20 merges

“Fig 13: Rate of increase in Height of the last 20 merges”

From the plots, a cut with 5 clusters gives the most desired result. The each cluster’s attribute count as a percentage of the population is shown below, as barplots.

from scipy.cluster.hierarchy import fcluster
max_d = 15
clusters = fcluster(R1[0], max_d, criterion='distance')
df3_filtered['cluster']=clusters

#To visualise results
C1=df3_filtered[df3_filtered.cluster==1]
C1=C1.drop(C1.columns[57], axis=1)
C1=C1.describe()
C2=df3_filtered[df3_filtered.cluster==2]
C2=C2.drop(C2.columns[57], axis=1)
C2=C2.describe()
C3=df3_filtered[df3_filtered.cluster==3]
C3=C3.drop(C3.columns[57], axis=1)
C3=C3.describe()
C4=df3_filtered[df3_filtered.cluster==4]
C4=C4.drop(C4.columns[57], axis=1)
C4=C4.describe()
C5=df3_filtered[df3_filtered.cluster==5]
C5=C5.drop(C5.columns[57], axis=1)
C5=C5.describe()

plt.figure(figsize=(25, 10))
plt.title('Cluster 1:Unknown Restaurants')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C1.ix[1,1:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==1]

Cluster 1

Fig 14: Cluster 1: Unknowns

“Fig 14: Cluster 1: Unknowns”

Cluster 1 contains 382 restaurants. Most of these restaurants do not provide much information for classification.

plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 2:Classy Restaurants')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C2.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==2]

Cluster 2

Fig 15: Cluster 2: The Classies

“Fig 15: Cluster 2: The Classies”

Cluster 2 contains 119 restaurants. They are the most expensive and classy type of restaurants (3x$), with classy, romantic, intimate, trendy and upscale atmospheres. The bar plot shows they are characterised by a full bar, dinner service, waiters, and taking reservations. Most of them require customers to be ‘dressy’, and are good for groups and dating. They are generally quiet and some provide background music.

plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 3:Stylish restaurants and Pubs & Bars')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C3.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==3]

Cluster 3

Fig 16: Cluster 3: The Stylish and Pubs & Bars

“Fig 16: Cluster 3: The Stylish and Pubs & Bars”

Cluster 3 contains 310 restaurants. These are stylish restaurants and pubs & bars priced at level 2 (2x$sign on yelp => $11-30 ), providing mainly dinner and lunch.

These are less quiet places, some can even be loud or very loud. These clusters contain most of the restaurants that provide dj, jukebox, live and video music and also TV broadcasts. Some of these restaurants are quite stylish with hipster, intimate or trendy styles. They are mainly good for groups.

plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 4: Fastfood & Takeaways')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C4.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==4]

Cluster 4

Fig 17: Cluster 4: The Fastfood and Takeaways

“Fig 17: Cluster 4: The Fastfood and Takeaways”

There are 171 restaurants in cluster 4. This group are the cheapest kind of restaurants ($ sign), probably fastfood restaurants as most of them do not provide waiter service, wine nor have bars. They are associated with being casual, being good for groups, and kids, having an average noise level, and takeout.

plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 5: General Restaurants')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C5.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==5]

Cluster 5

Fig 18: Cluster 5: The Generals

“Fig 18: Cluster 5: The Generals”

Cluster 5 contains 233 restaurants. They are general restaurants, providing brunch, dinner and lunch. They are less quiet but not the noisiest type. These restaurants are not very stylish and almost none of them provide any music. They are generally group and kids friendly.

4.2 Competition Analysis

Distribution of Resturant clusters across zip codes

df3_filtered["zip"]=df2["zip"]
df3_filtered.to_csv("C:/Users/cheukkin.Warwick/Desktop/dm_group/d3_filtered.csv")
from __future__ import division
df3_filtered["Type"]=df3_filtered.cluster
df3_filtered.Type=df3_filtered.Type.map({1:"Unknown", 2:"Classy",3:"stylish and Pubs & Bars",4:"Fastfood & Takeaways",5:"General"})
supply=pd.crosstab(df3_filtered["Type"],df3_filtered["zip"]).apply(lambda r: r*100/r.sum(), axis=1)
supply
Table 2: Restaurants distributions across zips

“Table 2: Restaurants distributions across zips”

Table 2 shows the distribution of the same type of restaurants across the zips as a percentage, each row sums up to 100%. For example, 35.6% of classy resturants are located in EH1.


Distribution of good reviews across zip codes by resturant cluster

review=pd.read_csv("C:/Users/cheukkin.Warwick/Desktop/dm_group/final_group_project/Kane Wu/edinReview.csv",header=0)
review=review.merge(df3_filtered, on=['business_id'], how='outer')
review_stars=review[((review.stars==4) == True) | ((review.stars==5) == True)]
top_views=pd.crosstab(review_stars["Type"],review_stars["zip"]).apply(lambda r: r*100/r.sum(), axis=1)
top_views
Table 3: Reviews distributions across zips

“Table 3: Reviews distributions across zips”

Table 3 similarly shows the distribution of good reviews for each restaurant cluster in different zips. Each row sums to 100%. For instance, for fastfood and takeaways 39.7% of 4and 5 star reviews are in from EH1.


Measure of outperformance (or competition) per zip code

(top_views-supply)/supply
Table 4: Ratio of disagreement of percentage distributions of review and restaurant across zips

“Table 4: Ratio of disagreement of percentage distributions of review and restaurant across zips”

A ratio is created to observe the performance (or the competition) of a resturant type per zip code (table 4):

\[competition= \frac{(\text{good reviews%} - \text{restaurants%})}{\text{restaurants%}}\]

It can also be thought of as \(\frac{\text{good reviews%}}{\text{restaurants%}} -1\), which is really asking whether there are more good reviews in a zip code than one would expect as a proportion of the resturants?“. The piviot is 0, where +ve is yes and -ve is no.


4.3 Recommendation

A potential strategy to pursue in choosing the resturant location is to pick a zip code with a high out-performance (or competition) ratio. Since the ratio is a proxy for volumes, customers will be drawn to these places, it could also save advertising costs. However, such a strategy would also be high risk as there will be high set up costs and intense competition in a popular location.

Given that the location analysis already choosen a zip code, EH3, choosing the theme/type of resturant is all that’s left. For EH3, classy has a competition ratio of -0.2, styish bars & pubs has a ratio of -0.2, fastfood & takeaways 0.0, unknown 0.0, and general 0.2. Based off the gap in the markets in EH3, the resturant should consider being either a classy resturant or a stylish bar & pub with good fish & chips and chicken dishes. The latter makes more sense, as the population income at EH3 is not relatively high.

5 Price, Volume and Revenue

To calculate potential revenue, the amount of customers a restaurant may receive was estimated as a multiple of the number of british restaurant reviews. (figure 19) It’s assumed that the number of reviews indicate approximately .1% of total visits for a year. For EH3 there are 161 reviews for British restaurants and 17 British restaurants in total, hence there are 9.47 average number of Reviews per British Restaurant for EH3. There is estimated to be around 9470 customers per year (9.47 * 1000). Note that the check-in dataset could also have been used as an alternative indicator of volumes, but check-ins are based on the amount of offers and so may represent a biased value for analysis.

british = edinburgh[edinburgh.cusine == 'British']
britishGroup = british.groupby('zip').cusine.count()
edinReview = pd.merge(reviews,british, how='inner' ,on="business_id")
reviewGroup = edinReview.groupby(['zip']).size()
grouped = pd.concat([britishGroup,reviewGroup, reviewGroup/britishGroup],axis=1)
Table 5: Average number of Reviews for British Restaurant per Zip

“Table 5: Average number of Reviews for British Restaurant per Zip”

The average price for British Cusine at EH3 was calculated for pricing. As shown in figure19 below the average price range is around 2.53.

Figure 19: Average Price for British Cusine

“Figure 19: Average Price for British Cusine”

Data on prices obtained from the Yelp site is used to calculate expected revenues. The mapping of Yelp price symbols to dollar ($) value is given in table 6.

Table 6: Yelp Symbol to Value

“Table 6: Yelp Symbol to Value”

2.5 is between the second and the third price range so an estimate of 30 pounds of average spend is multiplied by 9470 which is 284,100 pounds per year. This is a fairly conservative estimate or revenues.

6 Summary

To summarise, the demand and location analysis suggested an appropriate place to set up the resturant is EH3, a high volume area with an under supplied market for British cuisine. The competition and theme analysis, gave additional information by providing a metric of competition (the market in EH3 had competition metrics close to zero, implying fair competition in general), and suggested that there was low competition for stylish bars & pubs. The topic analysis of reviews concluded that making good fish & chips and chicken dishes could be a potentially successful endeavour, because this was found to be an area which people frequently complained about. A potential point to avoid is making the stylish bar/pub with food too quiet, as customers had complained about that in reviews. Also the table service and waiters would be polite and speedy, and the room tempreture should be monitored.

The revenue analysis helped to set a price range for the resturant and bar at a medium price, 30 pounds, and estimated 9470 customers a year for a revenue of 284,100.

7 Reference

[1] D. Cai, X. He, J. Han, and T. S. Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(8):1548–1560, 2011.

[2] J. Choo, C. Lee, C. K. Reddy, and H. Park. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics (TVCG), 19(12):1992–2001, 2013

[3] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. Wiley, 2009.

[4] Edinburgh ward and locality demographics. Sourced on June 9th 2016 from http://www.edinburgh.gov.uk/info/20247/edinburgh_by_numbers/1393/locality_and_ward_data_profiles

[5] City of Edinburgh Mapping Portal. Source on June 9th 2016 from http://data.edinburghcouncilmaps.info/datasets/2cee9b18a21344b0879c3c51d71fd2c6_28

[6] Yelp price symbol mapping to real values. Source on June 14th 2016 from http://www.yelp.com/topic/san-diego-can-anyone-give-me-the-actual-dollar-range-for-the-dollar-sign-symbols-in-rrgards-to-pricing